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Abstract 


Theory of Mind is the ability to attribute mental states 
(beliefs, intents, knowledge, perspectives, etc.) to others 
and recognize that these mental states may differ from one’s 
own. Theory of Mind is critical to effective communication 
and to teams demonstrating higher collective performance. 
To effectively leverage the progress in Artificial Intelligence 
(AI) to make our lives more productive, it is important for 
humans and AI to work well together in a team. Tradition- 
ally, there has been much emphasis on research to make 
AI more accurate, and (to a lesser extent) on having it bet- 
ter understand human intentions, tendencies, beliefs, and 
contexts. The latter involves making AI more human-like 
and having it develop a theory of our minds. In this work, 
we argue that for human-AI teams to be effective, humans 
must also develop a theory of Al's mind (ToAIM) — get to 
know its strengths, weaknesses, beliefs, and quirks. We in- 
stantiate these ideas within the domain of Visual Question 
Answering (VQA). We find that using just a few examples 
(50), lay people can be trained to better predict responses 
and oncoming failures of a complex VQA model. We further 
evaluate the role existing explanation (or interpretability) 
modalities play in helping humans build ToAIM. Explain- 
able AI has received considerable scientific and popular at- 
tention in recent times. Surprisingly, we find that having 
access to the model's internal states — its confidence in its 
top-k predictions, explicit or implicit attention maps which 
highlight regions in the image (and words in the question) 
the model is looking at (and listening to) while answering a 
question about an image — do not help people better predict 
its behavior. 


1. Introduction 


The capacity to attribute mental states! to other agents 
that are different from one's own, is called Theory of 
Mind [45]. Effective communication involves consider- 
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Figure 1: Current emphasis of research in human-AI collabora- 
tion is on AI modeling a human teammate's mental state (left top). 
We argue that for human-AI teams to be effective, humans must 
also have a model of the AT's strengths, weaknesses, and quirks. 
That is, humans must develop a Theory of AI’s Mind (left bottom). 
In this paper, we instantiate these ideas in the context of Visual 
Question Answering (right). Human subjects predict the success 
or failure (Failure Prediction), and output responses (Knowledge 
Prediction) of a VQA model (we call Vicki). 


ing a teammate's background knowledge, abilities, prefer- 
ences and modifying one's interactions accordingly [27]. 
Indeed, recent studies [18, 62] conclude that the most effec- 
tive teams are those with members who, among other traits, 
demonstrated good Theory of Mind abilities. 


As AI progresses, we find ourselves working with AI 
agents increasingly often. Intelligent virtual assistants like 
Siri, Cortana, Google Assistant, and Alexa make our lives 
more convenient. Doctors collaborate with IBM's Wat- 
son [21, 51], dividing work based on their expertise to make 
better informed diagnoses [30]. Visually-impaired users are 
starting to rely on computer vision algorithms to interpret 
the world around them [63, 36, 10]. In-vehicle AI in au- 
tonomous cars leverage humans' experience to make deci- 
sions in unpredictable situations [60]. 


Clearly, in each of these cases it is critical for the hu- 
man to have a sense for what the AI is good at (vs not), 
or when the AI might fail and should not be trusted. The 
human-AI team will be more effective if the human collab- 


orating with the AI agent has a deeper understanding of the 
AI agent's behavior. However, AI research has tradition- 
ally placed much of the burden on the AI to play its part 
in the team: to be more accurate [28, 48, 56, 49, 32], more 
human-like [3, 31, 40, 20, 2, 46, 42, 9, 11], understand our 
intentions [41, 57, 58], beliefs [19], tendencies [15], con- 
texts [47], and mental states [17, 16]. 


In this work, we argue that for human-AI teams to be ef- 
fective, Theory of Mind must go both ways. Humans must 
also understand the АГ beliefs, knowledge, and quirks 
(See Fig. 1).To clarify, we do not claim that a layperson 
should understand internal workings of the AI; only that 
building an intuition for its behavior is important in a col- 
laborative setting, particularly when the AI is imperfect. 
Today, Siri, Alexa, and image captioning systems are ex- 
amples of imperfect AI systems in production, while more 
sensitive applications (Al-assisted driving, diagnosis) are 
on the horizon. We propose that humans thinking of these 
as collaborations with an AI teammate, and working toward 
building Theory of (AI’s) Mind skills is beneficial from the 
standpoint of both effectiveness and safety (for when such 
imperfect systems fail). Such a skill is likely to thus be 
valuable to both designers of the AI and its users. 


Research suggests that to understand new entities such as 
a robot, humans project existing preconceptions and social 
constructs upon them [23]. However, as other recent works 
have shown [14, 24], the behavior of an AI agent is often 
quite different from that of a human — sometimes in ways 
that are surprising. Thus, inferences based on existing so- 
cial constructs or preconceptions may fail while estimating 
the behavior of AI agents. 


Of late, explainable AI has received considerable atten- 
tion from both the scientific community and popular me- 
dia. Our work evaluates the usefulness of such explanation 
modalities in the specific setting of human-AI teams. Most 
prior work on interpretability has focused on demonstrating 
the role of such explanations in improving trust. To the best 
of our knowledge, this is the first evaluation that measures 
the extent to which these modalities allow a human to build 
a mental model for the AI and predict its behavior. 


We consider an agent trained to perform the multi-modal 
task of Visual Question Answering (VQA) [4, 35]. Given an 
image and a free-form open-ended natural language ques- 
tion about the image, the AI agent's task is to answer the 
question accurately. Call this agent - a VQA model — Vicki. 
VQA is applicable to scenarios where humans (e.g., vi- 
sually impaired users, surveillance analysts, etc.) actively 
elicit information from visual data. It naturally lends itself 
to human-machine teams. The human teammates in our ex- 
periments are from Amazon Mechanical Turk (AMT). We 
consider two tasks that we believe demonstrate the degree 
to which a human understands their AI teammate Vicki — 
Failure Prediction and Knowledge Prediction. In Failure 


Prediction (FP), we show AMT subjects an image and a 
question about the image, and ask them to estimate if Vicki 
will correctly answer the question. Their ability to esti- 
mate Vicki's success or failure in a scenario is a measure of 
how well they understand its strengths and weaknesses. In 
Knowledge Prediction (KP), subjects are asked to estimate 
Vicki's exact response. Making an accurate estimation of 
the response of the agent requires a deeper understanding 
of its behavior. 


We study the extent to which humans can accurately esti- 
mate the behavior of Vicki. Then, we explicitly aid humans 
in developing a theory of Vicki's mind by (1) familiarizing 
them with Vicki's actual behavior during a training phase 
and (2) exposing them to Vicki's internal states via sev- 
eral existing 'explanation' modalities. We evaluate if these 
explanation modalities aid humans in accurately estimating 
Vicki's behavior (FP and KP). 


While Theory of (human) Mind might appear to involve 
many complex mental states beyond beliefs and knowledge 
(which are clearly applicable to AI), in practice, it is mea- 
sured by a fairly simple test — "reading the mind in the 
eyes" [8]. In a similar vein, we propose the two tasks of 
FP and KP as simple yet effective techniques to measure a 
person's understanding and estimation of an AI agent's be- 
havior, i.e., their Theory of an AI's mind (ToAIM)’. 


Contributions. The contributions of this work are: 


1. We advocate a line of research to study the extent to 
which humans can build a Theory of AI's Mind (ToAIM) 
and develop approaches to aid the process. 

2. As a specific instantiation of this, we consider the prob- 
lem of VQA where the AT's task is to answer a free-form 
natural language question about an image. 

3. We conduct large-scale human studies to measure the ef- 
fectiveness of training, and of different explanation modal- 
ities, in helping humans accurately predict the successes, 
failures,and output responses of a VQA model on question- 
image pairs. To the best of our knowledge, this is the first 
evaluation that measures whether interpretability mecha- 
nisms do, in fact, allow humans to build a model of AI. 
Our human studies infrastructure will be made available. 

4. Our key findings are that (1) humans are indeed capable 
of predicting successes, failures, and outputs of the VQA 
model better than chance. (2) explicitly training humans to 
familiarize themselves with the model by using just a few 
examples improves their performance (3) existing explana- 
tion modalities do not enhance human abilities at predicting 
the model's behavior. 


Involves looking at a photo of a human's eyes and choosing one of two 
adjectives that better describes the person's mental state. 

30n the subject of whether AI can have a mind at all, a number of 
philosophers suggest that it can. For instance, in ‘society of mind’ [38], 
Minsky says that a mind simply emerges as a result of complex interactions 
between many smaller non-intelligent entities which he calls agents. 


2. Related Work 


AI with a theory of (human) mind. A number of works 
in AI attempt to develop agents with an understanding of 
human characteristics and behavior. AI agents employing 
computer vision have been trained to predict the motiva- 
tions [57], intentions [41], actions [58], tendencies [15], 
contexts [47], etc., of humans. In addition, Scassellati [52] 
examines theories that explain the development of Theory 
of Mind in children and their applicability to building robots 
with similar capabilities. More recently, in the domain of 
abstract scenes, Eysenbach et al. [19] address the problem 
of identifying incorrect beliefs in people. The ability to 
identify false beliefs [61] in other agents is considered an 
important milestone in the development of Theory of Mind 
in an agent [7]. Unlike these works where AI agents “under- 
stand" humans, our work addresses the converse problem — 
to have humans understand AI agents, their quirks, weak- 
nesses, and beliefs. 

Explainable AI. Recently, there has been a thrust in the 
direction of “explainable” AI agents in vision-related tasks. 
Introspection vs Justification: Generating explanations 
for classification decisions has attracted considerable inter- 
est. Several works propose introspective explanations based 
on internal states of a decision process [67, 53, 26, 69], 
while others generate justifications consistent with model 
outputs [50, 29, 43]. Riberio et al. [50] explain the pre- 
dictions of a classifier by learning an interpretable model 
locally around the prediction. Hendricks et al. [29] develop 
a justification system that produces explanations consistent 
with visual recognition decisions. Natural language vs Vi- 
sual explanations: Prior art has assessed the usefulness 
of natural language explanations of model decisions in im- 
proving model trust [29]. MacLeod et al. [34] investigate 
the role of phrasing of a model’s confidence in blind and 
visually impaired persons’ trust in image captioning mod- 
els. Park et al. [43] propose a pointing and justification 
model for VQA that can both justify predictions in natural 
language and also point to visual evidence. Explicit vs Im- 
plicit attention: There is a line of work in designing mod- 
els that explicitly attend to relevant parts of their input for 
vision tasks such as object recognition [5, 39], image cap- 
tioning [65, 13], and VQA [33, 66, 64]. In contrast, recent 
work by Zhou et al. [70] and Selvaraju et al. [53] expose 
implicit attention for predictions from CNN-based models 
as visual explanations. 

Across these works, the focus is on making AI agents more 
transparent and capable of explaining their decisions in or- 
der to build trust. In our work, we explore the role explana- 
tion modalities play in improving a human’s model of the 
AI, as measured by the human’s accuracy at predicting the 
AT's success, failure, and output responses. 

Failure Prediction. There exists prior art that deals with 
building models that predict failure modes of systems [6, 


68]. Whereas these works employ statistical models to pre- 
dict failure modes of a base system, we evaluate the role a 
training phase as well as explanation modalities play when 
humans perform the same task. In addition to predicting 
the success or failure of AI agents, we also train humans 
to more accurately predict the “knowledge”, i.e., the actual 
output of an AI agent. 

Humans adapting to technology. A few works [59, 44] 
observe human strategies while adapting to the limited ca- 
pabilities of an AI agent in interactive language games. For 
instance, in a human-AI game of charades, humans modify 
strategies such as word selection, turn length, and prosody, 
to adapt to the robot's limited perceptive abilities. While 
both these works observe that humans dynamically adapt 
their behavior while interacting with an AI on a particular 
task, in our work we explicitly measure to what extent hu- 
mans have formed an accurate model of the AI. We also 
evaluate the role that explanation/interpretability modalities 
play in helping humans build a more accurate model. 


3. Meet Vicki 


We instantiate the idea of humans building a Theory of 
ATs Mind in ће VQA task. Our AI agent (that we call 
Vicki) is a VQA model trained to answer a free-form natural 
language question about an image. Concretely, we use the 
VQA model by Lu et al. [33]. It is a hierarchical coattention 
model that models the question at multiple levels of granu- 
larity (words, phrases, entire question) and at each level, has 
explicit attention mechanisms on the image (where to look) 
as well as the question (which words and phrases to listen 
to). Among the different variants introduced in [33], we 
use the alternating co-attention model trained with VGG- 
19 [55] as the CNN to derive image-representations. 

Vicki was trained on the VQA dataset [4] train split con- 
taining 248349 QI pairs, and outputs one of a 1000 possi- 
ble answers (most frequent in the train split). Its accuracy 
on the VQA dataset (test-standard) is 62.2%* (human ac- 
curacy is 83.396), which was the state-of-the-art at publica- 
tion [33] and is still competitive today. Moreover, Vicki's 
image and question attention maps provide access to its *in- 
ternal states’ while making a prediction. These maps high- 
light the regions of the image and words of the question that 
Vicki attends to. This presents an opportunity to assess the 
role such explanation modalities can play in aiding humans 
better predict Vicki's behavior. Among the various settings 
explored in [33], we use the question-level image and ques- 
tion attention maps in our experiments. 

Vicki is Quirky. There are several factors that contribute to 
Vicki being quirky, in a predictable fashion. Some of these 
quirks are well-known in VQA literature [1]. Vision is not 
perfect: Vicki, like most other vision models, has a lim- 


^http://www.visualga.org/roe.html 


ited capability to understand the image. Observing Vicki's 
behavior during its failures demonstrates its quirks. For in- 
stance, when the question asks the color of a small object in 
the scene, say a soda can, Vicki may simply respond with 
the most dominant color in the scene. This is clearly evi- 
dent when we observe the distribution of Vicki's responses 
across a diverse set of images [1]. Language is not per- 
fect: Vicki has a limited capability to understand free-form 
natural language. Vicki seems to converge on a predicted 
answer after listening to just half the question 4996 of the 
time [1]. So in many cases, it answers questions based 
only on the first few words of the question alone [1]. Vicki 
cannot reason: Vicki has no mechanism to leverage exter- 
nal knowledge and reason about common sense. Vicki is 
poor at compositionality — it is unable to disentangle and 
recompose concepts seen in training to generalize to unseen 
test concepts [1]. Vicki does not have an explicit counting 
mechanism [12]. So it often defaults to the popular answer 
“2” for "How many" questions. Vicki cannot say much: 
Since Vicki is a 1000-way classifier, it only has a fixed set 
of utterances. Vicki answers every question: Vicki was 
trained only on questions that were relevant to the image. 
Thus, Vicki does not know how to say “That doesn’t make 
sense.” or “There is no woman in this image.” when asked 
“What color is the woman's shirt?" on an image that does 
not contain a woman. Thus, when posed with a question 
that is irrelevant to the image, Vicki is forced to provide an 
answer from its limited vocabulary. Interestingly, because 
Vicki is a deterministic function of the question and image, 
observing its response across QI-pairs often gives us a sense 
for what it might be basing its responses on. Vicki may ig- 
nore the image: Vicki picks up on the language priors that 
are inherent in the world which are easier to leverage than 
complicated visual signals. For example, when the question 
“What color is the banana?" is asked, Vicki often ignores 
the image and answers "yellow". Vicki is biased: Vicki is 
very likely to answer “yes” to a yes/no question, and answer 
“white’ to a “what color" question due to biases inherent in 
the VQA dataset that it was trained on [25]. 

To get a sense for this, see Fig. 2. The patterns are clear. In 
top-left, even when there is no grass, Vicki tends to latch on 
to one of the dominant colors in the image. For top-right, 
even when there are no people in the image, Vicki seems to 
respond with what people could plausibly do in the scene 
if they were present. A priori, one (especially lay people) 
may not expect this. But when exposed to several examples 
of Vicki's responses, it is conceivable that subjects may be- 
gin to have an understanding of Vicki's behavior and con- 
sequently form a theory of its mind. 


4. Meet the tasks 


We present two tasks that can measure a human's un- 
derstanding of the capabilities of an AI agent such as Vicki. 


How many people are there? 4 What is the man holding? Fire Hydrant 


Figure 2: These montages highlight some of Vicki's quirks. For 
a given question, Vicki has the same response to each image in a 
montage. Common visual patterns (that Vicki presumably picks 
up on) within each montage are evident. 


These tasks are especially relevant to human-AI teams since 
they are analogous to measuring if a human teammate's 
trust in an AI teammate is well-calibrated, and if a human 
can estimate the behavior of an AI in a specific scenario. 
Failure Prediction (FP). In this task, we study the abil- 
ity of a human to predict the success or failure of Vicki. 
That is, given an image and a question about the image, we 
measure how accurately a person can predict if Vicki will 
successfully answer the question. A person can presumably 
predict the failure modes of Vicki reasonably well if they 
have a good sense of Vicki’s strengths and weaknesses. A 
collaborator who performs well on this task can accurately 
determine whether they should trust Vicki’s response to a 
question about an image. Please see a snapshot of the FP 
interface in Fig. 3a. Note that we do not show the human 
what Vicki’s predicted answer is.? 

Knowledge Prediction (KP). In this task, we measure the 
capability of a human to develop a deeper understanding of 
Vicki’s behavior. Given an image and a question, a person 
guesses Vicki’s exact response (answer) from a set of its 
output labels (vocabulary). Recall that Vicki can only say 
one of a 1000 things in response to a question about an im- 
age. Please see a snapshot of the KP interface in Fig. 3b. 
We provide subjects a convenient dropdown interface with 
autocomplete to choose an answer from Vicki’s vocabulary 
of 1000 answers. 

In FP, a good understanding of Vicki’s strengths and 
weaknesses might lead to good human performance. How- 
ever, KP requires a deeper understanding of Vicki’s behav- 
ior, rooted in its quirks and beliefs. In addition to reasoning 
about Vicki’s failure modes, one has to guess its exact re- 


5Otherwise, given an image and question from the VQA dataset, it 
would be trivial for the human to verify if Vicki’s predicted answer is right 
or wrong. See appendix for more details. 
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(b) The Knowledge Prediction interface. 


Figure 3: (a) A person guesses if an AI agent (Vicki) will answer this question for this image correctly or wrongly. (b) A person guesses 


what Vicki's exact answer will be for this question for this image. 


sponse for a given question about an image. Note that KP 
measures subjects' ability to take reality (the image the sub- 
ject sees) and translate it to what Vicki might say. High per- 
formance at KP is likely to correlate to high performance 
at the reverse task — take what Vicki says and translate it 
to what the image really contains. This can be very helpful 
when the visual content (image) is not directly available to 
the user. Explicitly measuring this is part of future work. 
A person who performs well at KP has likely successfully 
modeled a more fine-grained behavior of Vicki than just 
modes of success or failure. In contrast to typical efforts 
where the goal is for AI to approximate human abilities, KP 
involves measuring a human's ability to approximate a neu- 
ral network's behavior! 


5. Perception of VQA 


To set the baseline, we measure people's current esti- 
mates about VQA models. To this end, we briefly intro- 
duce Vicki to subjects as an "AI trained to answer questions 
about images". We then ask subjects to use their current un- 
derstanding and expectation of what AI agents can do, to es- 
timate the behavior of Vicki. We study the ability of humans 
to estimate Vicki's behavior via the FP and KP tasks. For 
both tasks, we randomly sample questions from the set of 
~1,400 most frequent questions in the validation set of the 
VQA dataset [4]. A description of our experimental setup 
for each task follows. 


5.1. Failure Prediction (FP) 


In this task, we show subjects a question and an image 
on which this question was asked in the VQA dataset, and 
ask if they think Vicki's response to the question-image 
pair (QI-pair) would be correct or wrong. To get ground 
truth, similar to the VQA accuracy metric [4], we check if 
Vicki's response matched at least 3 of 10 human-provided 
answers in the VQA dataset. Overall, a total of 88 unique 
subjects participated in our study, providing responses on 
1000 Ql-pairs. On average, subjects accurately guessed 
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Figure 4: Optimism regarding Vicki: people's estimate of the AI's 
success in answering a question, across different answer-types. 


whether Vicki would answer the question correctly (suc- 
cess) or not (failure) 59.88% of the time. The accuracy of 
always guessing success is 61.52%. While subjects’ per- 
formance seems lower than this, normalizing for the prior 
of each class (success vs. failure), always guessing success 
drops to 50% but humans are at 54.24%. This shows that 
even without prior exposure to Vicki, human subjects can 
predict its failure better than chance. 

We further measure people's optimism about Vicki's 
abilities. Fig. 4 shows the percentage of QI-pairs that sub- 
jects predicted Vicki would answer correctly for different 
answer types. We find that subjects expect Vicki to an- 
swer questions whose answers are numbers (e.g., count- 
ing questions) correctly quite often. Interestingly, today's 
VQA models are in fact quite ineffective at counting. The 
VQA leaderboard shows significant improvements in per- 
formance on "other" questions over time, but improve- 
ments on “number” questions has stalled. Overall, subjects 
demonstrated an average optimism — as measured by 96 of 
"correctly" (success) predictions — of 75.46%. 


5.2. Knowledge Prediction (KP) 


In the KP task, we ask subjects what they think Vicki 
would say in response to a question about an image. Note 
that the VQA dataset only contains questions about an im- 
age that are relevant to the image, as annotators were look- 
ing at the image while asking questions. So a question 
“What color is the man's shirt?” would only be asked for an 
image that contains a man wearing a shirt. 


As an interesting twist intended to elicit Vicki's quirky 
behavior described in Sec. 3, we also paired images with 
random (and likely irrelevant [46]) questions (e.g., “What 
are the people doing?" on an image that may not con- 
tain people). Recall that Vicki is forced to respond with 
a limited vocabulary (one of 1000 answers). These sam- 
ples are useful to measure a person's understanding of an 
agent's responses to any given stimulus — including those 
that come from a distribution under which the agent has not 
been trained. Note that FP cannot be evaluated on irrelevant 
images. The notion of a "correct" answer is ill-defined if a 
question is not relevant to an image. 

We performed the KP task? on 1000 QI-pairs (700 rele- 
vant and 300 irrelevant). We collected 25 responses to each 
pair. A total of 173 unique subjects participated in our study. 
The accuracy achieved by predicting Vicki's most popular 
answer (‘yes’) is 15.79%. We found that subjects were able 
to accurately predict Vicki's response 24.81% of the time. 


6. Familiarizing people with Vicki 


In this section we describe our experimental setup to fa- 
miliarize subjects with Vicki's behavior. We approach this 
in two ways – by providing instant feedback about Vicki's 
actual behavior on each QI pair once the subject responds, 
and by exposing subjects to various explanation modalities 
that reveal Vicki's internal states. 

Challenges. Collecting data for this setup is challenging 
for a couple of reasons: (1) Each subject has to go through 
a training phase to become familiar with Vicki before we 
can test them. This results in each task on AMT being un- 
usually long and expensive. It also reduces the subject pool 
down to those willing to participate in long tasks. (2) Once 
a subject does one task for us, they cannot do another task 
because the training / exposure to Vicki would leak over. 
This means we need as many subjects as tasks. This makes 
data collection quite slow. In light of these challenges, to 
systematically evaluate the roles of training and exposure to 
Vicki's internal states, we focus on a small set of questions. 
Data. We identify a subset of questions in the VQA [4] 
validation split that occur more than 100 times. We select 
7 diverse questions from this subset that are representative 
of the different types of questions (counting, yes/no, color, 
scene layout, activity, etc.) in the dataset/. For each of the 
7 questions, we then sample a set of 100 images. For FP, 
the 100 images per question are random samples from the 
set of images on which the question was asked in the VQA 
validation split (VQA-val). For the KP task, these 100 im- 
ages are random images from VQA-val. Ray et al. [46] 


6То control for familiarity, we ensured that subjects who perform a KP 
task are not allowed to perform an FP task. 

"What kind of animal is this? What time is it? What are the people 
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Figure 5: Average performance across subjects for both tasks: 
Failure Prediction (FP) and Knowledge Prediction (KP), with and 
without instant feedback (IF), and with various explanation modal- 
ities. Error bars are 95% confidence intervals from 1000 boot- 
strap samples. Note that the dotted lines are various machine ap- 
proaches applied to FP. 


found that randomly pairing an image with a question in 
the VQA dataset results in about 79% of pairs being irrele- 
vant. Recall that this combination of relevant and irrelevant 
question-image pairs allows us to test subjects’ ability to 
develop a robust understanding of Vicki’s behavior across a 
wide variety of inputs. 

Task setup. Each human study is comprised of 100 QI- 
pairs where a single question is asked across 100 images. 
The motivation behind keeping the question constant is to 
make it easier for the subject to pick up trends in Vicki’s re- 
sponses across images. The annotation task is broken down 
into a train phase where the person is shown 50 QI-pairs, 
and a test phase where we evaluate subject's performance 
on the remaining 50 QI-pairs. 


6.1. Does feedback help? 


To familiarize subjects with Vicki, we provide them with 
instant feedback during the train phase. Immediately after a 
subject responds to a QI-pair, we show them whether Vicki 
actually answered the question correctly or not (in FP) or 
what Vicki's response was (in KP). In the train phase, sub- 
jects are also shown a live score of how well they are do- 
ing and are allowed to scroll through feedback for previous 
images (of course, they are not allowed to change their an- 
swers to previous images). Once training is complete, no 
further feedback (including running score) is provided and 
subjects are asked to draw from the intuition they have built 
in training to best answer all questions in the test phase. 
Subjects are also paid a bonus if they do particularly well. 

To evaluate the role of instant feedback, we have 2 sub- 
jects do our study with and without instant feedback each, 
for each of the questions (7) and each task (FP and KP). 
This results in a total of 28 human studies (with 28 unique 
human subjects). Even without feedback, subjects still go 
through all 100 images. 

In FP, always answering “correctly” would result in an 
accuracy of 58.29%. We find that subjects do slightly better 
and achieve 62.66% accuracy on FP, even without prior fa- 
miliarity with Vicki (No Train). Thus, subjects are already 


slightly better calibrated with an АГѕ capabilities than un- 
bridled optimism (or pessimism). Further, we find that sub- 
jects that receive training as instant feedback (IF) achieve 
13.0996 (absolute) higher mean accuracies than those who 
do not (see Fig 5); IF vs No Train (No IF) for FP (green). 

In KP, answering each question with Vicki's most pop- 
ular answer overall (‘no’) would lead to an accuracy of 
13.496. Additionally, answering each question with Vicki's 
most popular answer for that question? leads to an accu- 
racy of 31.43%. Interestingly, subjects who are unfamil- 
iar with Vicki (No Train) achieve 21.27% accuracy — better 
than the most popular answer overall prior, but worse than 
the question-specific prior over Vicki's answers. The latter 
is understandable as subjects unfamiliar with Vicki do not 
know which of its 1000 possible answers are more likely a 
priori for each question. 

We find that mean absolute performance in KP with IF 
is 51.11%, 29.84% higher than KP without IF (see Fig 5; 
IF vs No Train (No IF) for KP (blue)). Subjects thus con- 
siderably outperform both the *most popular answer' and 
*most popular answer per question' priors. It is apparent 
that just from a few (50) training examples, subjects learn 
to generalize beyond Vicki's favorites among it's vocabu- 
lary of 1000 answers. Additionally, the 29.84% improve- 
ment over No Train for KP is significantly larger than that 
for FP (13.0996). This is understandable because a priori 
(No Train setting), KP is a much harder task as compared to 
FP, due to the increased space of possible subject responses 
given a QI-pair, and the combination of relevant and irrele- 
vant QlI-pairs in the test phase. 

Questions such as ‘Is it raining?’ have strong language 
priors — to these Vicki often defaults to the most popular 
answer (‘no’), irrespective of image. We observe that on 
such questions, subjects perform considerably better in KP 
once they develop a sense for Vicki's inherent bias via in- 
stant feedback. For open-ended questions like ‘What time 
is it?’, feedback helps subjects (1) narrow down the 1000 
potential options to the subset that Vicki typically answers 
with — in this case time periods such as ‘daytime’ rather than 
actual clock times and (2) identify correlations between vi- 
sual patterns and Vicki's answer (as seen in Fig. 2). In other 
cases like ‘How many people are in the image?’ the space 
of possible answers is clear a priori, but after IF subjects 
realize Vicki is not good at detailed counting but does base 
its count predictions on coarse signals of the scene layout. 

In Sec. 3, we described how montages (refer to Fig.2) 
help highlight Vicki's quirks. In order to test the effective- 
ness of such montages as a teaching tool, we also exper- 
imented with a modification of the KP + IF setting (two 


8Vicki’s most frequent answer (in the train set) to each question is as 
follows: What kind of animal is this? (Dog). What time is it? (Daytime). 
What are the people doing? (Standing). Is it raining? (No). What room 
is this? (Kitchen). How many people are there? (1). What color is the 
umbrella? (Black). 


unique subjects per question participated in this setting, re- 
sulting in an additional 14 human studies). In the train phase 
of this new setting, instead of individual images, subjects 
are shown a series of montages, each containing 4 to 16 im- 
ages across which Vicki gave the same answer to the ques- 
tion. The objective remains the same — to guess what that 
answer was (with IF provided after each guess). The test 
phase is kept identical to the KP + IF test phase, with a 
single image per question and no IF. We find that subjects 
achieve 41.6% mean accuracy in the test phase of this set- 
ting, which is lower than the mean accuracy in the test phase 
of the KP + IF setting (51.1%). Interestingly, mean accu- 
racy in the train phase of the montage setting is 68.746, sig- 
nificantly higher than the mean accuracy in the train phase 
of the KP + IF setting (49.3%). This seems to indicate that 
while montages make it much easier to guess Vicki's re- 
sponse correctly by picking out patterns (as seen in Fig. 7 
and Fig. 8), the focus on identifying commonalities between 
groups of images interferes with the ability to pick up on 
image-level patterns. As a result, subjects do not general- 
ize well to individual images at test time, resulting in worse 
performance. Keeping the train and test tasks identical (in- 
dividual images in both cases) is more effective. 

VQA Researchers. Just as an anecdotal point of reference, 
we also conducted experiments across experts with varying 
degree of familiarity with agents like Vicki. We observed 
that a VQA researcher had an accuracy of 8096 versus a 
computer vision (but not VQA) researcher who had 60% 
in a shorter version of the FP task without instant feed- 
back. Clearly, familiarity with Vicki plays a critical role 
in how well a human can predict its oncoming failures or 
successes. 


6.2. Do explanation modalities help? 


In this section, we briefly describe the different expla- 
nation modalities that we utilize to expose Vicki's internal 
states to the human subject. In addition to an image and 
question about the image, we also show the subject one of 
the three explanation modalities described below. Subjects 
are asked to use these as hints to perform the task (FP or 
KP) more accurately, and can leverage the training phase 
(with instant feedback and a running score) to learn how to 
best do so. 

We experiment with 3 qualitatively different explanation 
modalities (see Fig 6): 

Confidence of top-5 predictions. We show subjects 
Vicki's confidence in its top-5 answer predictions from its 
vocabulary as a bar plot?. If Vicki is relatively more con- 
fident in its top-1 prediction, it is more likely to be right. 
If Vicki is confused about the top-5 predictions, it is more 
likely to be wrong. Attention maps. Recall that Vicki is the 


°Of course, we don't show the actual top-5 predictions, just the confi- 
dence in the predictions. 


How many people are there ? How many people are there? 


Fu г 


Ql-Attention 


Top-5 answer confidence 


Figure 6: Screenshots of the interfaces of different explanation 
modalities that we show subjects. 


co-attention VQA model proposed by Lu et al [33] which 
jointly reasons about image and question attention (Sec 3). 
Thus, along with the image we show subjects the spatial at- 
tention map over the image that indicates the regions that 
Vicki is looking at and an attention map over each word of 
the question highlighting the relative importance of words 
in the question for Vicki, while producing an answer. We 
show subjects a legend to interpret what the colors in each 
attention map indicate. Grad-CAM. In contrast to explicit 
attention maps described above, we experiment with an im- 
plicit attention map. We use the CNN visualization tech- 
nique by Selvaraju et al. [53], using the attention maps cor- 
responding to Vicki’s most confident answer. 

We have 2 subjects perform each of our tasks (2) for each 
of the explanation modalities (3) for each question (7) re- 
sulting in a total of 84 tasks (and unique subjects). Across 
all studies (including those described in earlier sections), 
we have collected over 65k responses from 415 unique sub- 
jects. Conducting studies in-house in controlled environ- 
ments at this scale would be prohibitive. 

To put human FP accuracies (using explanation modal- 
ities) in perspective, we experiment with a few automatic 
approaches to detect Vicki’s failure from its internal states. 
We find that a decision stump on Vicki’s confidence in its 
top answer or the entropy of its 1000-way softmax output 
results in FP accuracy of 60% on our test set. We train 
a Multilayer Perceptron (MLP) neural network on Vicki’s 
output softmax and predict success vs failure. This achieves 
an FP accuracy of 81%!°. Training an MLP which takes as 
input question features (average word2vec embeddings [37] 
of words in the question) concatenated with image features 
(fc7 from VGG-19) to predict success vs failure (which we 
call ALERT following [68]) achieves an FP accuracy of 
65%. Note that these methods are trained on about 66% 


V Showing a visualization of this score to a human may make for a good 
"explanation modality" for FP! Exploring this is part of future work. 


of the VQA-val set (~81k examples, rest used for valida- 
tion). Human subjects are trained on only 50 examples. We 
only report machine results to put human accuracies in per- 
spective. We do not draw any inferences about the relative 
capabilities of both 

Accuracies of subjects in the test phase of both tasks (FP 
and KP) for different settings of the explanation modali- 
ties are summarized in Fig. 5. Recall, all studies that in- 
clude an explanation modality also include instant feed- 
back!! (IF) and a running score during training. For refer- 
ence, we also show performance of subjects with no expla- 
nation modality both with and without IF. We observe that 
on both tasks, subjects shown explanation modalities along 
with IF show no statistically significant improvement in per- 
formance over those shown just IF. In fact, in some cases 
performance is worse. While piloting these tasks ourselves, 
we found that it was easy to "overfit" to the explanation 
modalities and hallucinate patterns when none may exist. 
While the works introducing some of these modalities as- 
sessed their interpretability qualitatively or measured their 
role in improving human trust, our preliminary hypothesis 
is that these modalities may not yet help human-AI teams 
be more accurate in a goal-driven collaborative setting be- 
cause they do not yet help humans predict the AT's behavior 
more accurately. 


7. Conclusion 


We posit that as AI makes progress, human-AI teams are 
imperative. We argue that for these teams to be effective, 
it is not only important for the AI to be capable of mod- 
eling the intentions, beliefs, strengths and weaknesses of 
the human, but also for the human to build a Theory of the 
AT's Mind (ToAIM). Take-home message #1: We should 
pursue research directions to help humans build models of 
the strengths, weaknesses, quirks, and tendencies of AI. We 
instantiate these ideas in the domain of Visual Question An- 
swering (VQA). We propose two tasks that help measure the 
extent to which a human “understands” a VQA model (we 
call Vicki) — Failure Prediction (FP) and Knowledge Predic- 
tion (KP) – where given an input instance (question-image 
pair) the human has to predict whether Vicki will answer 
the question correctly or not, and what Vicki's exact answer 
will be. We evaluate the roles that familiarity with Vicki 
and explanation modalities that expose the internal states 
of Vicki play. Take-home message #2: Lay people indeed 
get better at predicting Vicki's behavior using just a few 
(50) "training" examples. Take-home message #3: Sur- 


!! Tn real-world settings, we consider familiarizing via instant-feedback, 
followed by showing explanation modalities, as the natural progression for 
acquainting subjects with Vicki. Hence, we evaluate the role explanation 
modalities play on top of instant feedback. Nevertheless, for sake of com- 
pleteness, studying the effect of showing explanation modalities on subject 
performance, independent of instant feedback, is part of future work. 


prisingly, existing explanation modalities that are popular 
in computer vision do not help make Vicki's failures more 
predictable. In fact, humans seem to overfit to the additional 
information provided and perform slightly worse at KP in 
the presence of explanation modalities. Take-home mes- 
sage #4: Clearly, much work remains to be done in devel- 
oping improved explanation modalities that do in fact help 
make AI more predictable to a human. 

As Al improves, does a user's ToAIM become outdated? 
In informal studies, we found human FP/KP performance 
with one VQA model to generalize to another. Further, we 
envision ToAIM building being a continual exercise inte- 
grated in the deployment of AI, with every major update 
involving a familiarization stage. 

This work just scratches the surface, and numerous av- 
enues of further exploration remain. Studying other vision 
models (AI agents in general) at varying points on the inter- 
pretability vs performance spectrum for other tasks, evalu- 
ating other existing explanation modalities, and conducting 
human studies at an even larger scale are natural extensions. 
Relevant to the increased interest in building interpretable 
models, this work presents novel opportunities to evaluate 
explanation modalities grounded in specific tasks (FP and 
KP). Finally, it would be exciting to close the loop and 
evaluate the extent to which improved human performance 
at FP and KP translates to improved success of human-AI 
teams at accomplishing a shared goal. Co-operative human- 
AI games may be a natural fit for such evaluation. 
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Appendix 
A. Introduction 


This supplementary material is organized as follows: We 
first discuss various visual recognition scenarios in which a 
human might rely on an AI, and motivate the need for The- 
ory of AI's Mind (ToAIM) in those scenarios. Next, we 
include video demonstrations of our FP and KP interfaces. 
We then provide more qualitative examples of montages 
(introduced in Fig.2 of main paper) that highlight Vicki's 
quirks, and additionally share insights on Vicki from sub- 
jects who completed the tasks. Finally, we describe an AMT 
survey we conducted to gauge public perception of AI, and 
provide a list of questions and analysis of results. 


B. Visual Recognition Scenarios and Applica- 
bility of ToAIM 


In general, one might wonder why a human would need 
Vicki to answer questions if they are already looking at the 
image. This may be true for the VQA dataset, but outside 
of that there are scenarios where the human either does not 
know the answer to a question of interest (e.g., the species 
of a bird), or the amount of visual data is so large (e.g., long 
surveillance videos) that it would be prohibitively cumber- 
some for them to sift through it. Note that even in this sce- 
nario where the human does not know the answer to the 
question, a human who understands Vicki’s failure modes 
from past experience would know when to trust its decision. 
For instance, if the bird is occluded, or the scene is clut- 
tered, or the lighting is bad, or the bird pose is odd, Vicki 
will fail. Moreover, the idea of humans predicting the АГѕ 
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failure (and ToAIM in general) also applies to other sce- 
narios where the human may not be looking at the image, 
and hence needs to work with Vicki (e.g., blind user, or a 
human working with a tele-operated robot). In these cases 
too, it would be useful for the human to have a sense for 
the contexts and environments and/or kinds of questions for 
which Vicki can be trusted. In this work, as a first step, we 
focus on the first scenario where the human is looking at the 
image and a question while predicting Vicki’s failures and 
responses. 


C. Interfaces 


To enable readers to experience the FP and KP tasks first- 
hand, we provide a link to the interfaces we used to train 
subjects: https://deshraj.github.io/TOAIM/. 
We also provide links to videos demonstrating each task: 
FP - https://youtu.be/Dcs7GOmTAns and KP ~ 
https://youtu.be/f_likwCuG4Q. Note that for il- 
lustration, we show just a single setting of the respective 
task in each interface and video. 


D. Vicki's Quirks 


We present more examples in Fig. 7 and Fig. 8 that high- 
light Vicki's quirks. Recall that there are several factors 
which lead to Vicki being quirky, many of which are well 
known in VQA literature [1]. As we can see across both 
examples, Vicki exhibits these quirks in a somewhat pre- 
dictable fashion. At first glance, the primary factors that 
seem to decide Vicki's response to a question given an 
image are the properties and activities associated with the 
salient objects in the image, in combination with the lan- 
guage and the phrasing of the question being asked. This is 
evident when we look across the images (see Fig. 7 and 8) 
for question-answer (QA) pairs such as — What are the peo- 
ple doing? Grazing, What is the man holding? Cow and Is 
it raining? No. As a specific example, notice the images 
for the QA pair What color is the grass? Blue (see Fig. 7) — 
Vicki's response to this question is the most dominant color 
in the scene across all images even though there is no grass 
present in any of them. Similarly, for the QA pair What 
does the sign say? Banana (see Fig. 8) — Vicki's answer is 
the salient object across all the scenes. 

Interestingly, some subjects did try and pick up on some 
of the quirks and beliefs described previously, and formed a 
mental model of Vicki while completing the Failure Predic- 
tion or Knowledge Prediction tasks. We asked subjects to 
leave comments after completing a task and some of them 
shared their views on Vicki's behavior. We share some of 
those comments below. 

The abbreviations used are Failure Prediction (FP), Knowl- 
edge Prediction (KP) and Instant Feedback (IF). 


. FP 


e "These images were all pretty easy to see what animal 
it was. I would imagine the robot would be able to get 
90% of the animals correct, unless there were multiple 
animals in the same photo.” 


e "Ithink the brighter the color the more likely they are 
to get it right. Multi-colored, not so sure. " 


e "I'd love to know the answers to these myself.” 
. FP + IF 


e "This is fun, but kind of hard to tell what the hints 
Can she determine the color differences in 
multi-colored umbrellas or are they automatically 
marked wrong because she only chooses one color in- 
stead of all of the colors? It seems to me that she just 
goes for the brightest color in the pic. This is very 
interesting. Thank you! :)" 


mean. 


e "I didn't quite grasp what the AI's algorithm was for 
determining right or wrong. I want to say that it was if 
the AI could see the face of the animal then it guessed 
correctly, but I'm really not sure. " 


. FP + IF + Explanation Modalities 


e “Even though Vicki is looking at the right spot doesn’t 
always mean she will guess correctly. To me there was 
no rhyme or reason to guessing correctly. Thank you.” 


e “I think she can accurately know a small number of 
people but cannot know a huge grouping yet.” 


e "Iwould be more interested to find out how Vickis met- 
rics work. What I was assuming is just color phase 
and distance might not be accurate." 


. KP 


e "Time questions are tricky because all Vicki can do is 
round to the nearest number." 


e “there were а few that seemed like it was missing obvi- 
ous answers - like bus and bus stop but not bus station. 
Also words like lobby seemed to be missing." 


. KP + IF 


e "Interesting, though it seems Vicki has a lot more 
learning to do. Thank you! " 


e "This HIT was interesting, but a bit hard. Thank you 
for the opportunity to work this." 


. KP + IF + Explanation Modalities 


e “You need to eliminate the nuances of night time and 
daytime from the computer and choose one phras- 
ing "night" or "day" Vicki understands. The nuance 
keeps me and I'm sure others obtaining a higher score 
here on this task." 
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e "Ifelt that Vickie was mistaken as to what some colors 
were for the first test which probably carried over and 
I tried my best to recreate her responses." 


7. KP + IF + Montages 


e "Iam not sure that I ever completely understood how 
Vicki thought. It seemed it had more to do with what 
was in the pictures instead of the time of day it looked 
in the pictures. If there was food, she chose noon or 
morning, even though at times it was clearly breakfast 
food and she labeled it noon." 


e "Itdoesn't seem very accurate as I made sure to count 
and took my time assessing the pictures." 


e "it is hard to figure out what they are looking for since 
there isn't many umbrellas in the pictures" 


On a high-level reading through all comments, we found 
that subjects felt that Vicki's response often revolves around 
the most salient object in the image, that Vicki is bad at 
counting, and that Vicki often responds with the most dom- 
inant color in the image when asked a color question. In 
Fig. 9a, we show a word cloud of all the comments left by 
the subjects after completing the tasks. From the comments, 
we observed that subjects were very enthusiastic to famil- 
iarize themselves with Vicki, and found the process engag- 
ing. Many thought that the scenarios presented to them were 
interesting and fun, despite being hard. We used some ba- 
sic elements of gamification, such as performance-based re- 
ward and narrative, to make our tasks more engaging; we 
think the positive response indicates the possibility of mak- 
ing such human-familiarization with AI engaging even in 
real-world settings. 


E. Perception of AI 


Before introducing people to Vicki and gauging their ex- 
pectations from a modern VQA system, we attempted to 
assess their general impressions of present-day AI. 

We asked each subject to fill out a survey with questions 
aimed to collect three types of information: 

1. Background information. We collected basic demo- 
graphic information such as age, gender, educational qual- 
ifications, type of residential area, and profession. We also 
collected socio-economic background information such as 
employment status and income group. 

2. Familiarity with computers and AI. We asked subjects 
if their jobs involved computers, if they knew how to pro- 
gram, how much time they spent in front of a computer 
or smartphone, and their familiarity with popular AI assis- 
tants such as Siri, Alexa and Google Assistant. We also 
asked if they were aware of recent advances in AI, espe- 
cially those trending in popular media, such as Watson [22], 
AlphaGo [54], machine learning, and deep learning. 


this? Skateboard і [ \ Pink 


Driving j i Posing WI Nothing 


Figure 7: Given a question (red) we show images for which Vicki gave the same answer (blue) to the question to observe Vicki's quirks. 
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Yes many le ar ?5 та! ) re there? 2 


Playing frisbee \ ] i Wi Banana 


Figure 8: Given a question (red) we show images for which Vicki gave the same answer (blue) to the question to observe Vicki’s quirks. 
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3. Estimates of AT's capabilities. We asked subjects their 
duration and source of exposure to AI and gathered their 
impressions on the capabilities of modern-day AI systems 
on a range of tasks. We also asked them about their under- 
standing, trust and sentiment towards modern AI systems, 
as well as their expectations and predictions for AI in the 
future. 


In Fig. 10, 11 and 12, we break down the 321 subjects 
that completed the survey by their response to each ques- 
tion. 


As part of the survey, subjects were also asked a few 
subjective questions about their opinions on present-day 
AT's capabilities. Specifically, they were asked to list tasks 
that they thought AI is capable of performing today (see 
Fig. 9b), will be capable of in the next 3 years (see Fig. 9c), 
and will be capable of in the next 10 years (see Fig. 9d). 
We also asked how they think AI works (see Fig. 9e). In 
Fig. 9b, 9c and 9d, we show word clouds corresponding to 
what subjects thought about the capabilities of AI. We also 
share some of those responses below. 


1. Name three things that you think AI today can do. Pre- 
dict sports games; Detect specific types of cancer in images; 
Control house temp based on outside weather; translate; 
calculate probabilities; Predictive Analysis; AI can predict 
future events that happen like potential car accidents; lip 
reading; code; Facial recognition; Drive cars; Play Go; 
predict the weather; Hold a conversation; Be a personal 
assistant; Speech recognition; search the web quicker. 
2. Name three things that you think AI today can't yet do 
but will be able to do in 3 years. Fly planes; Judge emo- 
tion in voices; Predict what I want for dinner; perform 
surgery; drive cars; manage larger amounts of information 
at a faster rate; think independently totally; play baseball; 
drive semi trucks; Be a caregiver; anticipate a person's ly- 
ing ability; read minds; Diagnose patients; improve robots 
to walk straight; Run websites; solve complex problems like 
climate change issues; program other ai; guess ages; form 
conclusions based on evidence; act on more complex com- 
mands; create art. 
3. Name three things that you think AI today can't yet do 
and will take a while (> 10 years) before it can do it. Imitate 
humans; be indistinguishable from humans; read minds; 
Have emotions; Develop feelings; make robots act like hu- 
mans; truly learn and think; Replace humans; impersonate 
people; teach; be a human; full AI with personalities; Run 
governments; be able to match a human entirely; take over 
the world; Pass a Turing test; be a human like friend; inti- 
macy; Recognize things like sarcasm and humor. 
Interestingly, we observe a steady progression in sub- 
jects’ expectations of AT's capabilities, as the time span in- 
creases. On a high-level reading through the responses, we 
notice that subjects believe that AI today can successfully 
perform tasks such as machine translation, driving vehi- 


16 


cles, speech recognition, analyzing information and draw- 
ing conclusions, etc. (see Fig. 9b). It is likely that this is 
influenced by the subjects’ exposure to or interaction with 
some form of AI in their day-to-day lives. When asked 
about what AI can do three years from now, most subjects 
suggested more sophisticated tasks such as inferring emo- 
tions from voice tone, performing surgery, and even deal- 
ing with climate change issues (see Fig. 9c). However, the 
most interesting trends emerge while observing subjects' 
expectation of what AI can achieve in the next 10 years 
(see Fig. 9d). A major proportion of subjects believe that 
AI will gain the ability to understand and emulate human 
beings, teach human beings, develop feelings and emotions 
and pass the Turing test. 


We also observe how subjects think AI works (see 
Fig. 9e). Mostly, subjects believe that an AI agent today 
is a system with high computational capabilities that has 
been programmed to simulate intelligence and perform cer- 
tain tasks by exposing it to huge amounts of information, 
or, as one of subjects phrased it — broadly AI recognizes 
patterns and creates optimal actions based on those pat- 
terns towards some predefined goals. In summary, it ap- 
pears that subjects have high expectations from AI, given 
enough time. While itis uncertain atthis stage how many, or 
how soon, these feats will actually be achieved, we believe 
that building Theory of AI’s mind skills will help humans 
generally become more active and effective collaborators in 
human-AI teams. 

As an interesting tidbit: Fig. 12 shows what 96 of sub- 
jects think certain tasks are "solved". 80-9096 of the sub- 
jects think AI today can recognize faces and infer your 
mood from social media posts. 65-70% of subjects think AI 
today can recognize handwriting, be creative (write, com- 
pose, draw), or drive a car. They are more split on whether 
AI can describe an image in a sentence. However, most 
(96%) agree that AI today cannot read our minds! Inter- 
estingly, 62% of subjects think that AI can become smarter 
than the smartest human. 

We now provide a full list of all questions in the survey. 
1. How old are you? 

(a) Less than 20 years 
(b) Between 20 and 40 years 
(c) Between 40 and 60 years 
(d) Greater than 60 years 
2. What is your gender? 
(a) Male 
(b) Female 
(c) Other 
3. Where do you live? 
(a) Rural 
(b) Suburban 
(c) Urban 
4. Are you? 


(a) A student 
(b) Employed 
(c) Self-employed 
(d) Unemployed 
(e) Retired 
(f) Other 
5. To which income group do you belong? 
(a) Less than 5000$ per year 
(b) 5,000-10,000$ per year 
(c) 10,000-25,000$ per year 
(d) 25,000-60,000$ per year 
(e) 60,000-120,000$ per year 
(f) More than 120,000$ per year 
6. What is your highest level of education? 
(a) No formal education 
(b) Middle School 
(c) High School 
(d) College (Bachelors) 
(e) Advanced Degree 
7. What was your major? 
(a) Computer Science / Computer Engineering 
(b) Engineering but not Computer Science 
(c) Mathematics / Physics 
(d) Philosophy 
(e) Biology / Physiology / Neurosciences 
(f) Psychology / Cognitive Sciences 
(g) Other Sciences 
(h) Liberal Arts 
(i) Other 
0) None 
8. Do you know how to program / code? 
(a) Yes 
(b) No 
9. Does your full-time job involve: 
(a) No computers 
(b) Working with computers but no programming / cod- 
ing? 
(c) Programming / Coding 
10. How many hours a day do you spend on your computer 
/ laptop / smartphone? 
(a) Less than 1 hour 
(b) 1-5 hours 
(c) 5-10 hours 
(d) Above 10 hours 
11. Do you know what Watson is in the context of Jeop- 
ardy? 
(a) Yes 
(b) No 
12. Have you ever used Siri, Alexa, or Google Now/Google 
Assistant? 
(a) Yes 
(b) No 
13. How often do you use Siri, Alexa, Google Now, Google 


Assistant, or something equivalent? 
(a) About once every few months 
(b) About once a month 
(c) About once a week 
(d) About 1-3 times a day 
(e) More than 3 times a day 
14. Have you heard of AlphaGo? 
(a) Yes 
(b) No 
15. Have you heard of Machine Learning? 
(a) Yes 
(b) No 
16. Have you heard of Deep Learning? 
(a) Yes 
(b) No 
17. When did you first hear of Artificial Intelligence (AI)? 
(a) І have not heard of AI 
(b) More than 10 years ago 
(c) 5-10 years ago 
(d) 3-5 years ago 
(e) 1-3 years ago 
(f) In the last six months 
(g) Last month 
18. How did you learn about AI? 
(a) School / College 
(b) Conversation with people 
(c) Movies 
(d) Newspapers 
(e) Social media 
(f) Internet 
(g) TV 
(h) Other 
19. Do you think AI today can drive cars fully au- 
tonomously? 
(a) Yes 
(b) No 
20. Do you think AI today can automatically recognize 
faces in a photo? 
(a) Yes 
(b) No 
21. Do you think AI today can read your mind? 
(a) Yes 
(b) No 
22. Do you think AI today can automatically read your 
handwriting? 
(a) Yes 
(b) No 
23. Do you think AI today can write poems, compose mu- 
Sic, make paintings? 
(a) Yes 
(b) No 
24. Do you think AI today can read your Tweets, Facebook 
posts, etc. and figure out if you are having a good day or 


not? 

(a) Yes 

(b) No 
25. Do you think AI today can take a photo and automati- 
cally describe it in a sentence? 

(a) Yes 

(b) No 
26. Other than those mentioned above, name three things 
that you think AI today can do. 
27. Other than those mentioned above, name three things 
that you think AI today can't yet do but will be able to do in 
3 years. 
28. Other than those mentioned above, name three things 
that you think AI today can't yet do and will take a while 
(> 10 years) before it can do it. 
29. Do you have a sense of how AI works? 

(a) Yes 

(b) No 

(c) If yes, describe in a sentence or two how AI works. 
30. Would you trust an AT's decisions today? 

(a) Yes 

(b) No 
31. Do you think AI can ever become smarter than the 
smartest human? 

(a) Yes 

(b) No 
32. If yes, in how many years? 

(a) Within the next 10 years 

(b) Within the next 25 years 

(c) Within the next 50 years 

(d) Within the next 100 years 

(e) In more than 100 years 
33. Are you scared about the consequences of AI? 

(a) Yes 

(b) No 

(c) Other 

(d) If other, explain. 
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Figure 10: Population Demographics (across 321 subjects) 
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Figure 11: Technology and AI exposure (across 321 subjects) 
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Figure 12: Perception of AI (across 321 subjects) 
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